Energy consumption is a critical concern worldwide due to its impact on the environment, economy, and human welfare. Therefore, understanding the factors that influence energy consumption in buildings is essential to optimize energy use and minimize its negative effects. Multiple linear regression is a statistical method used to model the relationship between a dependent variable and several independent variables simultaneously. In this report, we perform a multiple linear regression analysis to investigate the factors that affect energy consumption. The analysis is based on a dataset that includes information on natural gas consumption and several variables related to weather conditions (such as the mean external temperature and the irradiance). The objective of this study is to identify the significant predictors of energy consumption and provide insights into the underlying mechanisms that drive energy use.
The report is organized in the following sections:
The dataset utilized in this analysis is composed by 3 numerical variables, total daily gas consumption Energy \([Smc]\), mean daily external temperature Text \([°C]\), and mean solar irradiance Iext \([W/m^2]\) and 1 categorical variable, the day of the week DayofWeek.
The dataset provides daily measurements of these variables for a full heating season in Turin, which goes from \(1^{st}\) November to \(31^{th}\) March, resulting in a total of 151 records.
In the table below is reported a sketch of the dataset.
The trend of the variables during the heating season is represented in the figure below.
Will be useful for the further steps to summarize the dataset in terms of statistical quantities and distributions:
## date DayOfTheWeek Text Iext
## Min. :2017-11-01 Min. :1.00 Min. :-5.950 Min. : 0.50
## 1st Qu.:2017-12-08 1st Qu.:2.00 1st Qu.:-0.115 1st Qu.: 3.48
## Median :2018-01-15 Median :4.00 Median : 2.920 Median : 34.34
## Mean :2018-01-15 Mean :4.04 Mean : 3.103 Mean : 41.23
## 3rd Qu.:2018-02-21 3rd Qu.:6.00 3rd Qu.: 6.605 3rd Qu.: 71.47
## Max. :2018-03-31 Max. :7.00 Max. :11.610 Max. :182.10
## Energy day_name
## Min. : 0.0 Length:151
## 1st Qu.:257.1 Class :character
## Median :389.2 Mode :character
## Mean :382.2
## 3rd Qu.:556.2
## Max. :676.8
In this section, an outlier detection process is employed with the aim to identify possible values of the variables analyzed that can be consider far enough from the distribution of data and that can lead to incorrect or misleading conclusions when developing a multiple regression model.
In this case one way to operate could be the use of the Cook’s distance, which is a measure of the influence of each observation on a regression analysis. It can be used to identify multivariate outliers in non-normal distributions, like ours, by examining the values of Cook’s distance for each observation. Large values of Cook’s distance indicate observations that are having a disproportionate influence on the regression analysis, which could be due to being outliers.
Cook’s distance is evaluated as:
\[D_i = \frac{\sum_{j=1}^n (\hat{y_j} - \hat{y_{j(i)}})}{ps^2}\] where \(\hat{y_j}\) is the predition of the mean using the j observation and \(\hat{y_{j(i)}}\) is the prediction of the mean without the i-observation, \(s^2\) is the mean square error and \(p\) is the number of independent variables.
To better visualize the dataset, a 3D scatter plot is reported in the figure below, coloring in different ways the \(DayoftheWeek\).
As we can easily seen, Sundays are day of the week where there is no energy consumption, so can be easily eliminated from the model to improve the accuracy.
Now we can perform the model and evaluate the Cook’s distance:
A thumb’s rule using the Cook’s distance to outlier detection is considering a threshold value of \(4/n\), where \(n\) is the number of observations (130). So, records with Cook’s distance higher than \(0.031\) are considered outliers and eliminated from the model to make it more accurate.
Let’s plot the Cook’s distances and the threshold identified:
How we can see, 4 outliers have been identified using this metric, which are 2017-11-13, 2018-01-11, 2018-02-20, 2018-02-23.
Now we can eliminate these data and re-perform the linear regression model, evaluating its performance.
Once data are cleaned, it is possible to perform a linear regression model using the external temperature \(T_{ext}\) and \(I_{ext}\).
##
## Call:
## lm(formula = Energy ~ Text + Iext, data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.589 -35.919 -9.473 38.182 92.225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 567.6156 6.4319 88.250 < 2e-16 ***
## Text -33.2272 0.9738 -34.123 < 2e-16 ***
## Iext -0.5605 0.1072 -5.228 7.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.35 on 123 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.9076
## F-statistic: 614.6 on 2 and 123 DF, p-value: < 2.2e-16
How can be easily observed, the regression model employed yields a robust result. In fact we can observe the following features:
For completeness we show also some plot metric used to visualize the strenght of the regression model.